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Abstract 

Many businesses are using recommender systems for marketing outreach. Rec- 
ommendation algorithms can be either based on content or driven by collaborative 
filtering. We study different ways to incorporate content information directly into the 
matrix factorization approach of collaborative filtering. These content-boosted matrix 
factorization algorithms not only improve recommendation accuracy, but also provide 
useful insights about the contents, as well as make recommendations more easily inter- 
pretable. 
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1 Introduction 



Many businesses today are using the Internet to promote and to sell their products and 
services. Through the Internet, businesses can easily market many items to a large number 
of consumers. With a vast number of items, however, consumers may be overwhelmed by 
their choices. That is why, in an effort to maintain customer satisfaction and loyalty, many 
businesses have also integrated the use of recommender systems in their marketing strategies. 
For example, the online store www . amazon . com will suggest, based on a user's past purchases, 
products that he or she may be interested in. 

Recommender systems today typically use one of two approaches: the content-based ap- 
proach, or the collaborative filtering (CF) approach. In the content-based approach (e.g.. 
Pandora, www . pand ora . colnD , a profile is created for each user and for each item. The user 
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profile describes the contents that he or she likes, and the item profile describes the contents 
that it contains. To a given user, the system recommends items that match his or her pro- 
file. In the CF approach (e.g., Netfiix, I www.netf lix . coni f), users who have rated the same 
items closely are considered to have similar preferences overall. To a given user, the system 
recommends items that similar users have rated favorably before. 

For an extensive review and discussion of different CF algorithms as well as an up-to-date 
and comprehensive bibliography, we refer the readers to a recent article by Feuerverger et al. 

While various algorithms have been adapted for the recommendation problem including 
restricted Boltzmann machines most CF algorithms can be classified into two broad 
categories [3|: those based on nearest neighbors and those based on matrix factorization. 
While the nearest-neighbor approach is more intuitive, the matrix-factorization approach 
has gained popularity as a result of the Netfiix contest 



1.1 Focus of paper 

Perhaps the most important lessons from the Netfiix contest are that, in terms of prediction 
accuracy, it is often difficult for any single algorithm to outperform an ensemble of many 
different algorithms [l|, 0], and that algorithms of different flavors, when taken alone, often 
have similar predictive power for a given problem. 

For example, as shown by Feuerverger et al. [l[ in their Table 1, on the Netfiix data, a 
neighborhood-based method alone (labelled "kNN" in their table) had a root mean squared 
error (RMSE) of 0.9174 whereas a method based on matrix factorization alone (labelled 
"SVD" in their table) had an RMSE of 0.9167 — very close indeed. A significant drop in 
the RMSE (to 0.8982) was achievable only when the two classes of methods were combined 
together; see also Koren [3]. And it is widely known that the ultimate winner in the Netfiix 
contest (with an RMSE of 0.8572) was an ensemble of no fewer than 800 different algorithms. 

Therefore, a research project on CF can either focus on new classes of CF algorithms 
that are fundamentally different from existing ones, or focus on improvements or extensions 
within a certain class. For projects of the first type, the key question is whether the new class 
of algorithms is better than other classes. For those of the second type, the key question is 
whether the proposed extension adds any value when compared with baseline algorithms in 
the same class. The research we will report in this paper is strictly of the second type. In 
particular, we focus on the matrix factorization approach only. 



1.2 The "cold start" problem 

One advantage of the CF approach is that it does not require extra information on the users 
or the items; thus, it is capable of recommending an item without understanding the item 
itself However, this very advantage is also the root cause of the so-called "cold start" 
problem, which refers to the general difficulty in performing CF for users and items that are 
relatively new. By definition, newer users are those who have not rated many items, so it is 
difficult to find other users with similar preferences. Likewise, newer items are those which 
have not been rated by many users, so it is difficult to recommend them to anyone. 

Various ideas have been proposed to deal with the "cold start" problem. Park et al. [g!] 
suggested using so-called "filterbots" — artificial items or users inserted into the system 
with pre-defined characteristics. For instance, an action-movie filterbot can make recom- 
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mendations to new users who have only hked one or two action movies. More recently, 
Zhao et al. j?! suggested shared CF, an ensemble technique that aggregates predictions from 
several different systems. Since one recommender system may have data on user-item pairs 
that another one does not, it is possible to improve recommendations by sharing information 
across different systems. 

Another common approach for dealing with the "cold start" problem is to fill in the missing 
ratings with "pseudo" ratings before applying CF. For example, Goldberg et al. [8| did this 
with principal component analysis; Nguyen et al. 0] did this with rule-based induction; while 
Melville et al. [lO| did this with a hybrid, two-step approach, creating "pseudo" ratings with 
a content-based classifier. 



1.3 Objectives and contributions 



The key idea behind the hybrid approach is to leverage supplemental information Many 



recent works have taken this basic idea to new heights, successfully exploiting supplemental 
information from different sources and in various forms, for example, tagging history [12I], 



personality traits 
In this paper. 



13, 14 



15, 16 



social networks 

we focus on a particular type of supplemental information 
information about the individual items. 



and Wikipedia articles [17i\ . 

content 
we may know their 
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For example, for recipes 
ingredient lists; for movies we may know their genres. Moreover, we focus on ways to take 
advantage of such content information directly in the matrix factorization approach, not by 
using a hybrid or two-step algorithm. We refer to our suite of algorithms as "content-boosted 
matrix factorization algorithms" . 

Not only can these content-boosted algorithms achieve improved recommendation accu- 
racy (Section [4. Sp . they can also produce more interpretable recommendations (Section [SH]), 
as well as furnish useful insights about the contents themselves that are otherwise unavail- 
able (Section 15. 2p . More interpretable recommendations are becoming ever more desirable 
commercially, because users are more likely to act on a recommendation if they understand 
why it is being made to them 19|, |20| , while better understandings of contents can facilitate 



the creation of new products, such as recipes with substitute ingredients. 



1.4 Outline 

We proceed as follow. In Section [21 we give a brief review of the matrix factorization (MF) 
approach for collaborative filtering. In Section [3l we present a number of different content- 
boosted MF algorithms. In Section HI we describe the data sets we used and the experiments 
we performed to study and evaluate various algorithms. In Section [5], we discuss useful by- 
products from these content-boosted MF techniques. We end in Section E] with a brief 
summary. 



2 Matrix factorization: A brief review 

Before we start, it is necessary to review the basic matrix factorization method briefly. Our 
review follows the work of Koren et al. 4] . 
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2.1 Notation 



Given a set of users U = {ui, . . . , un}, and a set of items / = {zi, . . . , Uf }, let rui denote 
the rating given by user u to item i. These ratings form a user- item rating matrix, R = 
[rui]NxM- In principle, can take on any real value but, in practice, is typically binary, 
indicating "like" and "dislike", or integer- valued in a certain range, indicating different levels 
of preferences, e.g., r^i G {1, . . . , 5}. 

Often, the rating matrix R is highly sparse with many unknown entries, as users typically 
are only able to rate a small fraction of the items — recall the "cold start" problem discussed 
briefly in Section 11.21 We denote 

T = {{u,i) : Tui is known} 

as the set of indices for known ratings. Given an unknown (user, item)-pair, {u, i) ^ T, the 
goal of the recommender system is to predict the rating that user u would give to item 
which we denote by r„j. Furthermore, we define 

T„. = {i : {u,i) e T} 
to be the set of items that have been rated by user u, and 

T.i = {u : {u,i) e T} 
to be the set of users who have rated item i. 

2.2 Normalization by ANOVA 

Despite its overwhelming simplicity, an ANOVA-type of model often captures a fair amount 
of information in the rating data [il, 3. The simplest ANOVA-type model used in the 
literature consists of just main effects, i.e., 

= fi + au + (3i + eui, (1) 

where eui is white noise, fi is the overall mean, au represents a user-effect, and /3j represents 
an item-effect. These two main effects capture the obvious fact that some items are simply 
better liked than others, while some users are simply more difficult to please. 

It is common in the literature to normalize the rating matrix R by removing such an 
ANOVA-type model before applying any matrix-factorization (or nearest-neighbor) methods 
[e.g. ,3]. In all of our experiments reported below, we followed this common practice, that is, 
all matrix-factorization algorithms were applied to ruj — — — and the predicted rating 
was actually + /x + a?„ -|- where r„j was the prediction from the matrix-factorization 
algorithm, and /i, q?„, /3j were the MLEs of fi, au, Pi- In order not to further complicate our 
notation, however, this detail will be suppressed in our presentation, and we still use the 
notations, r^i and R, despite the normalization step. 

2.3 Matrix factorization 

To predict unknown ratings in R, the matrix factorization approach uses all the known 
ratings to decompose the matrix R into the product of two low-rank, latent feature matrices. 
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one for the users, Pnxk, and another for the items, QAfxi^, so that 



R R = PQ' 



P^ 



T 

P^ 



NxK 



[ qi q2 • ■ ■ qm ] 



(2) 



KxM 



The latent feature vectors — pu for user u {u = 1, 2, iV) and for item i {i = 1, 2, M) 
— are i^-dimensional, where K ^ min{M, N} is pre-specified. The predicted rating for the 
user-item pair {u, i) is simply 

Intuitively, one can imagine a K-dimensional map, in which p„ and q^ are the (latent) 
coordinates for user u and item i, respectively, and all the information that we need in order 
to make recommendations is contained in such a map — users will generally like items that 
are nearby. Latent-coordinate models have a long history, e.g., principal component analysis, 
factor analysis, multidimensional scaling, and so on [see, e.g., 2ll . 

Mathematically, the factorization ([2]) can be achieved by solving the optimization problem. 



min II R 

P,Q 



PQ 



T ||2 



(3) 



where || ■ || is the Frobenius norm. To prevent over- fitting, it is common to include a regu- 
larization penalty on the sizes of P and Q, turning the optimization problem above into 



min II R • 

P,Q 



PQ^r + A ||P| 



IQf). 



(4) 



From a Bayesian point of view, the first part of the objective function (jlj) can be viewed as 
coming from a Gaussian likelihood function; the regularization penalties can be viewed as 
coming from spherical Gaussian priors on the user and item feature vectors; and the solution 
to the optimization problem itself is then the so-called maximum a posteriori (MAP) estimate 



2.4 Relative scaling of penalty terms 

Feuerverger et al. |[l| used empirical Bayes analysis to argue that one should, in principle, 
always penalize ||pu|P and ||qi|P by different amounts. In practice, their advice is not always 
followed because the extra computational burden to select two tuning parameters rather 
than one is substantial, and the resulting payoff in terms of performance improvement may 
not be significant. 

In our work, we found it convenient to scale the second penalty term — the one on HQp 
— by a factor 7 > such that, regardless of how many users (iV) and how many items (M) 
there are, the penalty on ||Qp is always on the same order of magnitude as the penalty on 
II Pp. We will come back to this point later (Section 14. ip . 

Furthermore, since most entries in R are unknown, we can only evaluate the first term in 
dl]) over known entries (m, i) G T . This means the optimization problem actually solved in 
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practice is: 



min Lbl(P,Q) = 5Z (r„i-p;^qi)2 + A Q^||p„f + 7^||qif j . (5) 

{u,i)eT \ u i / 

The subscript "BL" stands for "baseline". For the purpose of comparison, we will refer to 
this method below as the baseline matrix factorization method, or simply the baseline (BL) 
algorithm. 

2.5 Alternating gradient descent 

With both P and Q being unknown, the optimization problem is not convex. It can be 
solved using an alternating gradient descent algorithm [3], moving along the gradient with 
respect to p„ while keeping qj fixed, and vice versa. 

Let denote the derivative of Lbl with respect to p„ and Vf^, its derivative with 
respect to q^. Then, 

V^^ oc 5^ -(r„, -p;^qi)q, + Ap„, (6) 

Vr oc J2 -(^» - Pnq^P^ + ATQi, (7) 

for every u = 1,2, N and i = 1,2, M. At iteration (j + 1), the updating equations for 
Pu and qi are: 

P^/^^^ = p1/^ - r/Vr (p^/UF) , (8) 

q?^'^ = qF-^Vr(p^\qF), (9) 

where rj is the step size or learning rate. The algorithm is typically initialized with small 
random entries for p„ and q^, and iteratively updated over all -u = 1, . . . , iV and i = 1, . . . , M 
until convergence (see Algorithm!!]). We will say more about initialization later (Section l4.5p . 

Algorithm 1 Alternating Gradient Descent Algorithm for Optimizing L^l — Eq- (E]) 

Input: R = [ruilNxM, K 
Output: P, Q 

1: initialize j ^ and choose Y'^^\ Q*^°) (see Section |43!) 

2: repeat 

3: for all M = 1, . . . , and i = 1, . . . , M do 
4: compute and Vf^ using ([H])-(IZD 
5: update pl^'+^^ and qp+^^ with 
6: end for 

7: until [Lbl(P(^'), Q(^')) - L^d^^^^'\ Q(^-+i))]/Lbl(P(^'\ Q^^'^) < e 
8: return P, Q 
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2.6 SVD and other matrix factorization techniques 



In the CF hterature, the matrix factorization approach outhned above is often dubbed the 
"singular value decomposition (SVD) approach" [see, e.g., [ll, 0, 23|. Strictly speaking, this 



is a bit misleading. The SVD is perhaps the single most widely used matrix factorization 
technique in all of applied mathematics; it solves the following problem: 

min ||R-P,D,QJf (10) 
s.t. D^, is diagonal with rank 

PJP, = I and QrQ. = I. 

By letting P = P.D^/^ ^nd Q = Q.D^/^ SVD would give us 

R^PQ^ (11) 

1/2 1/2 

such that P'^P = PJP^D* = is diagonal, meaning that P is an orthogonal matrix, 
and likewise for Q. However, the matrix factorization approach outlined above does not 
require either P or Q to be orthogonal. To be sure, we confirmed this directly with the 
winners of the Netflix contest [23], who used this technique pervasively in their work. With- 
out the orthogonality constraints, this would certainly raise identifiability and degeneracy 
questions for the optimization problem (|5]), but these problems can be avoided in practice 
by carefully initializing the alternating gradient descent algorithm — we elaborate on this 
detail in Section 14.51 below. 

Lee and Seung [25|] popularized another matrix factorization technique called the non- 
negative matrix factorization (NMF), which is ([3]) with the additional non-negativity con- 
straints that 

Puk > and Qik > for all k. 



The NMF has been used to analyze a wide variety of data such as images 25|] and gene 



expressions 26|] to reveal interesting underlying structure. In recent years, it has also been 



used to perform CF [e.g., |27|, |28| even though finding underlying structures in the data is 
often not the primary goal for CF. Matrix factorization with either orthogonality constraints 
(e.g., SVD) or nonnegativity constraints (e.g., NMF) is more sound mathematically, since 
the problem is somewhat ill-defined without any constraints. However, we will still focus 
only on the unconstrained version outlined above (Section 12. 3p since it remains the most 
dominant in the CF community, owing partly to its wide use in the three-year-long Netflix 
contest. 



3 Content-boosted matrix factorization 

Now, suppose that, for each item z, there is a content vector = [an, . . . , aio] of D attributes. 
Stacking these vectors together gives an attribute matrix, A = [aid\My.D- For simplicity, we 
assume that all entries in A are binary, i.e., aid ^ {0? 1}? each indicating whether item i 
possesses attribute d. In what follows, we study and compare different ways of incorporating 
this type of content information directly into the matrix factorization approach. We present 
two classes of methods with slightly different flavors. One class uses extra penalties with 
selective shrinkage effects (Section 13. ip . and the other uses direct regression constraints 
(Section ESD- 
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3.1 Alignment-biased factorization 



To incorporate A into the matrix factorization approach, one idea is as follows: if two items 
i and i' share at least c attributes in common — call this the "common attributes" condition, 
then it makes intuitive sense to require that their feature vectors, and q^/, be "close" in 
the latent space. 



3.1.1 Details 

For the matrix factorization approach, it is clear from ([2]) that the notion of closeness is 
modeled mathematically by the inner product in the latent feature space. Therefore, to say 
that qi and qi/ are "close" means that their inner product, qjqi', is large. We can incorporate 
this preference by adding another penalty, which we call the "alignment penalty", to the 
optimization problem (|5]). 

For binary aid, the "common attributes" condition is easily expressed by aja,/ > c. Let 

Sc{i) = {i' : i! ^ i and a^aj/ > c}. 

We solve the following optimization problem: 

min L,3(P,Q)=i.B.(P,Q)-A7$^ (12) 



alignment penalty 

where Lbl(P, Q) is the baseline objective function given by ([S]), and the notation \S\ means 
the size of the set S. Notice that we make the alignment penalty adaptive to the size of 
Sc{i). The subscript "AB" stands for "alignment-biased". 

It is easy to see that the basic idea of alternating gradient descent still applies. For Lab, 
the gradient with respect to clearly remains the same, that is, 

yAB ^ yBL 



while the gradient with respect to q^ becomes 



Vf^ oc ^ -(r„i - p;^qi)p„ + A7 



j'e5c(i) 



(13) 



The updating equations are identical to ([8])- ([9]), except that V^"" and Vf" are replaced by 



V^^^ and Vf^ 



3.1.2 Differential shrinkage effects 

The effect of the alignment penalty can be seen explicitly from f|T3l) as shrinking the latent 
vector of each item toward the centroid of items that share a certain number of attributes 
with it. This is the selective shrinkage effect that we alluded to earlier (Section [3l page[7j), 
and it plays a central role. 
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Next, we introduce a generalized/smoothed version of our alignment penalty (Section l3.1.3p 
as well as a related but slightly different mathematical formulation (Section I3.1.4p . We will 
see that the main difference between these methods lies in their respective shrinkage ef- 
fects — in each iteration, they shrink towards slightly different centroids and by slightly 
different amounts; see the terms inside the square brackets in f|T3l) . f|T6|) . f|T8l) and (flQl) . 



3.1.3 A smooth generalization 

An obvious generalization of the alignment penalty is to change ( fT2l) into 

M M 



mm 

P,Q 



l,^b(p,Q) = ^bl(p,Q) 



=1 i'=i 



with 



oc 



gen. alignment penal, 
exp [9 (a^^a^/ - c)] 



(14) 



1 + exp [9 {ajai^ - c)] ' ^^^^ 

The "proportional" relation "oc" in f lT5]) means the weights w{i,i') are typically normalized 
to sum to unity, i.e., 'w(^, "^0 = 1 for any given i. The alignment penalty used in f[T^ 

corresponds almost everywhere to the special and extreme case of 6* — )■ oo; for 9 < oo, w{i, i') 
is a smooth, monotonic function of the number of attributes shared by items i and i' , rather 
than an abrupt, step function (see Figured]). 



w(s,t) 



■8 = 5 



.•■ 9 = 0.5 



asat<c 



a,a,=c 



asat>c 



Figure 1: The function w(s, t) as given by (fT5l) . for 9 = 0.5, 1, 5, oo. 



For LgAB (fT^ - the gradient with respect to p„ again remains the same, V^^'^ = V^'^ 
while the gradient with respect to qj simply becomes 

M 



Vf^ oc ^ -(r„i-p;:qi)p„ + A7 



q_i -^w{i,i')qi 



i'=i 



yBL 



(16) 
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Using smoother weights would allow all items that share attributes with i to contribute to 
the shrinkage effect, not just those that share at least a certain number of attributes with 
it. Moreover, their contributions would be adaptive — the amount of "pull" that item i' 
exerts on the feature vector of item i is appropriated by how many attributes they share in 
common. Depending on how much information there is in the data, this could potentially 
enhance the effectiveness of the alignment penalty. 



3.1.4 A related method: Tag informed CF 

Noticing that many commercial recommender engines allow users to create personalized tags, 
Zhen et al. 12| proposed a method to exploit information from these tags. Following the 
work of Li and Yeung [29|, their idea was to "make two user-specific latent feature vectors 
as similar as possible if the two users have similar tagging history" by adding a tag-based 
penalty to the baseline optimization problem: 

N N 



mm Lbl(P, Q) + -^X] WPu - Pu'\\'w[u,U J, 

u=l u'=l 



tag-based penalty 

where w{u,u') is a measure of similarity between two users based on their tagging history. 
Interestingly, if we replace the word "user" with "item" and the phrase "tagging history" 
with "content" or "attributes", the same idea can be applied to items, i.e.. 



M M 

min Ltg(P,Q) =i^BL(P,Q) + A7 5^5^||q.-qdl'^(^,0, (17) 

It ,lo4 

i=l i'=l 

where w{i, i') is the similarity between two items based on their content information, and 
the subscript "TG" stands for "tag" indicating where the original idea came from. But since 

II Il2 II ||2 , II ||2 n T 

llqi-Qi'll = llQill +llqi'll -2qjqj/, 

it is easy to see that this leads to a similar but slightly different mathematical formulation, 
essentially consisting of 

(i) penalizing Hpup and ||qi|p by different amounts (even if 7 = 1) — in particular, the 
penalty in front of ||qj|P is multiplied by (1 -|- 2wi.), where 



M 
i'=l 



and 



(ii) using the generalized version of our alignment penalty (fT^ . up to the specific choice 



of w{i, i') itself. 
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Again, for Ltg (pTI) the gradient with respect to remains the same, = 
while the gradient with respect to becomes 

M 



yBL 



;i + 2wi.)cii - 2 ^ w{i, i')qr 



i'=i 



;i8) 



We can see that, when compared with f|T6|) . the selective shrinkage effect is somewhat at- 
tenuated in ( |T8l) . This is most clearly seen if we normalize the weights to sum to one, i.e., 
Wi. = J2i'^=i^ih'^') = 1- Then, ( |T8l) simply becomes 

„ M 



3A7 



E 

i'=i 



(19) 



Equation f|T9|) reveals a curious factor of 2/3 in front of the weighted centroid, which clearly 
dampens this algorithm's corresponding shrinkage effect. 

One of the similarity measures used by Zhen et al. [12,] is the cosine similarity. 



w{i, i') 



aja,;/ 



(20) 



Although other similarity measures can also be used, for binary attributes (see Section [3l 
page [7]) the cosine similarity has an intuitive appeal as it amounts to something easily 
interpretable: 

(# attributes shared by i and i') 



w{i, i') 



a/(# attributes in i)(# attributes in i') 



(21) 



3.2 Regression-constrained factorization 



Another idea for incorporating content information stored in the matrix A is to use a 
regression- style constraint, forcing each item feature vector to be a function of the item's 
content attributes, so that items with identical attributes are mapped to the same feature 



vector. This method was first introduced by our group in a short conference paper [18 



3.2.1 Details 

Specifically, the constraint is 



Q = AB, 



(22) 



where B is a D x i^T matrix. Each column of B behaves like a (vector) regression coefficient 
that maps the items to a latent feature using their content attributes. Each row of B can 
be viewed as a i^-dimensional latent feature vector for the corresponding attribute. 
Under the constraint (|22|) . the factorization ([2]) becomes 



R ^ PQT = PB^A^ 



Pi 
P^ 



B"^ [ ai aa ■ • ■ aM ]. 



KxD 



DxM 



NxK 
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and the optimization problem ([5]) becomes 



min Lhc(P,B)= 5^ (r„,-p^B"a,)2 + A (5^i|p„||2 + 7||Bf j . (23) 



Again, the alternating gradient descent algorithm is applicable. The gradient of L^^ with 
respect to p„, V^'^, is the same as V^^ — Eq. ([S]), except that we replace qj with B'^aj, i.e., 

« 5^ -(r„, - p;^B-a,)B-a, + Ap„. (24) 

Using the fact that (i(x'^My)/(iM = xy"^, we can derive easily that the gradient of L^c with 
respect to the matrix B is 

VS'^ (X -(r»-p^B-a,)a,p: + A7B. (25) 

(u,j)GT 

At iteration (j + 1), the updating equations are: 

Pi^'+^^ = p(f-)-r^Vr(p(f),B(^-)), (26) 
B(i+i) = B^^') -r/V|^ (pi^'),B(^')) . (27) 

3.2.2 Related literature 

The idea of incorporating regression relationships into latent factor models also has a long 
history. For example, ecologists used to apply a multivariate technique known as correspon- 



dence analysis [30|, |3l| and fit so-called ordination models to sort species and geographical 
sites with latent coordinates [e.g., l32|; sites with similar conditions would have close-by 
coordinates, and likewise for species that prefer similar environments. Later, canonical cor- 
respondence analysis (CCA) was introduced ^], which constrains the latent site coordinates 
to be linear functions of actual environmental measurements at those sites. CCA has since 



become an extremely popular technique in the field of environmental ecology [34 . 



4 Experiments 

In this section, we describe the data sets we used and the experiments we performed to 
compare and evaluate various content-boosted MF algorithms against the baseline MF al- 
gorithm. We use the acronyms BL, AB, gAB, TG, and RC to refer to the algorithms; these 
acronyms should be self-evident from Sections [2] and [31 Table [1] briefly summarizes all the 
algorithms being compared and studied. 

4.1 The scaling factor 7 

As we have alluded to earlier (Section l2.4p . the purpose of the scaling factor 7 is to balance 
the two penalties — the one on ^ ||Pm|P and the other on X] 11^* IP (^^ l|B|P in the case 
of RC) — so that the objective function is not dominated by either the user or the item 
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Table 1: Summary of algorithms compared. 



label 


obj 


func. 


other details where applicable 

7 c w{i,i') 9 


BL 




Eq. (ED 


N/M 




AB 


Lab 


Eq. (113) 


N/M 


1 - - 


gAB 




Eq. (in 


N/M 


1 Eq. (Il5D"f 1 


TG 


Ltg 


Eq. (HID 


N/{3M) 


- Eq. ([20Dt - 


RC 


Lrc 


Eq. (EHD 


N/D 





^ The weights are normahzed to sum to one for every i. 



side of the equation. Since the quantity ^ ||Pn|P remained constant in this paper and the 
algorithms differed only in terms of how they regularized the q^'s, the use of 7 also allowed 
us to compare all algorithms on the same scale. With this in mind, we used 7 = N/M for 
(BL, AB, gAB) and 7 = N/D for RC. For TG, recall that, when the weights were normalized 
to sum to one, every llqjp was multiplied by a factor of 1 + 2wi. = 3 (see Section r3.1.4p . In 
order to compare everything on the same scale, we calibrated this extra factor of 3 back to 
1 by choosing 7 = N/lsM) for TG. 



4.2 Data sets 



We used two data sets — "Recipes" and "Movies" . The data set "Recipes" , is a subset of 



data crawled from http : / / allrecipes . com/ by Forbes and Zhu [18|, including only recipes 
rated by at least 90 users, and users who rated at least 50 recipes. The data set, "Movies", 
is the "MovieLens lOOK" data set from http : / / www . grouplens . org/ . 



Table 2: Summary statistics for data sets. 





Recipes 


Movies 


# of users, 


1,706 


943 


^ of items, M 


1,040 


1,682 


# of attributes, D 


1,057 


19 


# of known ratings, T 


64,941 


100,000 


density ratio, \T\/{MN) 


3.7%t 


6.3% 



Notice that this ratio would have been even lower 



had we used the full recipe data from [18 



For "Recipes", the ratings are integers between and 5, and the binary attribute aid is 
an indicator of whether recipe i contains ingredient d. For "Movies" , the ratings are integers 
between 1 and 5, and aid is an indicator of whether movie i belongs to genre d — notice that 
the same movie can (and often do) belong to multiple genres. Table |2] contains summary 
statistics about these two data sets. 
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4.3 Evaluation 



To compare and evaluate different algorithms, we repeated the same experiment 15 times. 
Each time, we sampled 50% of the user-item pairs {u, i) eT to serve as a hold-out validation 
set, denoted by T'. Using the remaining 50% of the known ratings, we learned the matrices 
P and Q (or B in the case of RC) with different algorithms. Ratings for all {u, i) G T' were 
predicted by r„j = p^qj (or r^i = p^B'^aj in the case of RC)0 — with proper truncation if 
r„j fell outside [0,5] (for "Recipes") or [1,5] (for "Movies") — and evaluated by the mean 
absolute error (MAE) metric: 

MAE = -i- J2 Ki-rui\. 



{u,'i)eT' 



Many researchers [e.g., |12|, |28| have considered the MAE more appropriate for discrete rat- 
ings, and the literature is increasingly favoring the use of the MAE as opposed to the root 
mean squared error (RMSE), which dominated the Netflix contest. For each algorithm, we 
examined factorizations of a few different dimensions, in particular, K = 5,10 and 15. 



4.4 Additional details for AB and gAB 

For both data sets, most items do not share any attribute in common; for those that do, the 
number of attributes shared is typically small (see Figure [2]). Thus, we chose c = 1 for AB 
and gAB, activating the alignment penalty as long as two items shared any attribute at all. 

Generally speaking, one can certainly regard c as an additional tuning parameter for AB, 
but if performance is measured with gross overall metrics such as the MAE or the RMSE, 
then the range of reasonable choices for c is fairly limited in our opinion. We think the best 
strategy is to choose c so that the alignment penalty is activated for a certain x% of the 
item-pairs, and the sensible range for x is somewhere between 10 and 50. If only a handful 
of item-pairs were subject to the alignment penalty, the overall MAE or RMSE would barely 
be affected. On the other hand, if more than half of the item-pairs were subject to such 
a penalty, items would almost certainly be shrunken blindly toward those with which they 
have little in common. The limited range of sensible values for x and the discrete nature of 
c often greatly restrict the choice of c. Take the "Movies" data set, for example. Choosing 
c > 2 would have resulted in a; < 2.2, whereas choosing c = would have resulted in x = 100 
(by definition), so the only sensible choice remaining is c = 1, which gives a; ~ 35. 

As for gAB, it is clear that a large smoothing parameter 6 will cause it to behave very much 
like AB, whereas a small 6 will essentially eliminate the effect of the alignment penalty. To 
focus on main ideas rather than fine details, we only provide an illustration of this algorithm 
using 9 = 1. 



4.5 Initialization 

We have already mentioned that, when both P and Q are unknown, the optimization problem 
(|5]) is not convex, which means the alternating gradient descent algorithm will give us local 
solutions at best. Hence, a good initialization strategy is useful. 

^The predicted rating was actually r„i + + + Pi; see Section [521 
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(a) Recipes 



(b) Movies 



o 
o" 



o 
o 

o_ - 

o' 

ir> 



_L_ 



5 10 
number of ingredients sinared 



15 



12 3 4 

number of genres stiared 



Figure 2: Distribution of the number of attributes shared by pairs of items. 



4.5.1 SVD strategy 

For given one way to obtain reasonably good initial values of P and Q is as follows. 
First, impute the missing entries of R with predictions from a certain rudimentary model 
(more below) — call the resulting matrix R^,. Then, apply regular SVD and obtain the best 
rank-i^ approximation to R,,: 

R* ~ P^,D^Q^. 

Finally, initialize P with Psvd = P^D^^ and Q with Qsvd = Q*Dy^. In practice, since 
both P(0) and Q(°) are orthogonal matrices, such an initialization strategy is often enough to 
guard against degeneracy even though the optimization problem (|5]) is somewhat ill-posed 
without explicit orthogonality constraints (see Section [2^ . 

The ANOVA model ([T]) can be used as a rudimentary prediction model for imputing the 
missing entries. But since the ANOVA model was actually removed prior to the application 
of any matrix factorization techniques (Section 12. 2p . all imputed values should just be zero 
— this would correspond to imputing the missing entries with predictions from the ANOVA 
model before the normalization took place. 

It is easy to see that such an initialization strategy would be applicable to BL, AB, gAB, 
and TG. For RC, however, an extra step would be required to obtain B'-''-* from Q^^-*. Since 
the RC constraint is Q = AB, the most natural way to do so would be to initialize B with 

b(1 = (A-A)-^A-Q(1, (28) 

or, ii D > M (in which case A'^A would not be invertible), 

B^l = (A^A + 51)-' A^Q^l (29) 

for some 5 > 0. Our default choice was to set 5 to the median value of the diagonal elements 
in A'^A. 
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4.5.2 Mixed strategy 



While practically useful on its own, the aforementioned SVD strategy posed a subtle problem 
for comparison: it forced RC into a relative disadvantage. This is because, if (Psvd, Qsvd) is 
a reasonably good initial factorization of R, then (Psvd, ABsvd) will not be as good, since 

ABsvD = A (A A) A Qsvd 

is a simply projected version of Qsvd- Figure [3] provides a geometric explanation of why this 
is the case. 

For a fair comparison of all algorithms, we therefore used a mixed strategy for initialization. 
More specifically, the matrix P was initialized with 

p(0) _ ^p{0) , n - k)P^°^ 

' — rb-T SVD I rbJ-T RANDOM, 

where Psvd was obtained using the SVD strategy, and P^andom was a random matrix whose 
elements were sampled independently from N(0,cr^). The same procedure was used to ini- 
tialize Q and/or B. For given K, the parameters k and a were chosen separately for (BL, 
AB, gAB, TG) and for RC so that the initial factorizations yielded approximately the same 
level of predictive performance for all the algorithms (see Figure S]). 




Figure 3: A geometric explanation of why the SVD initialization strategy forces RC into a 
relative disadvantage. In this illustration, Psvd is fixed; Qsvd gives the best factorization of 
R*; and anything other than Qsvd gives a worse factorization. 



4.6 The choice of A 

Our mixed initialization strategy (Section I4.5p . which ensures that the initial factorization 
has approximately the same performance for all algorithms, and the way we have scaled the 
penalty terms (Section l4.ip . so that the penalty on Yl (or J2 in the case of RC) 

is on the same order of magnitude as the penalty on ^ ||p„||^ — a quantity that remains 
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constant for all algorithms, imply that, for the purpose of fair comparison, we could (and 
should) use the same A for all algorithms. 

Table E] lists the A's we used for all the experiments. Our A's increased with the 
dimension (or rank) of the factorization, because more regularization was needed for factor- 
ization models that contained more parameters. For any given larger A's were used for 
the "Movies" data set than for the "Recipes" data set because the "Recipes" data set was 
more sparse, i.e., the ratio \T\/N was smaller (see Table [2]). This meant that, in the case of 
BL for example, the same level of regularization as measured by the ratio, 

A(EJ|p«P + 7E.llq.P)' 

could be achieved with a smaller A. 

Table 3: The size of the penalty (A) and the learning rate {rj) used for different experiments. 





A 


f] (xlO-3) 


K 


Movies Recipes 


Movies Recipes 


5 


25 8 


2.0 2.0 


10 


50 12 


1.0 1.5 


15 


75 16 


0.5 1.0 



4.7 Convergence criterion and the learning rate t] 

All algorithms were presumed to have reached convergence when the percent improvement 
in their respective objective functions fell below a pre-specified threshold, that is, when 

LO) _ _f,(,i+i) 

We used e = 0.005 for all algorithms. 

For gradient descent algorithms, it is well understood that r] should be kept fairly small 
to ensure that we are moving in a descent direction at each iteration. On the other hand, 
for practical reasons (e.g., so that the algorithm doesn't take forever to finish running) we'd 
like to use the largest t] feasible — one that still ensures that we are moving downhill. 
For the convergence criterion (130|) . however, it was critical that the learning rate r] did not 
differ significantly for different algorithms. Suppose algorithm 1 used a relatively large r] 
and algorithm 2 used a relatively small one. Then, relative to algorithm 1, algorithm 2 
could "converge" prematurely according to (15U]) simply because the small r] did not allow its 
objective function to change very much from iteration to iteration. Therefore, for any given 
K, not only did we use the same A for all algorithms, we also used the same rj (see Table |3]). 

4.8 Results 

Figure H] summarizes our experimental results. We can see that, starting with initial fac- 
torizations of roughly the same quality and using the same level of regularization (as con- 
trolled by 7 and A), the same learning rate [rj), and the same convergence criterion ( |30l) . the 
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content-boosted algorithms (AB, gAB, TG, RC) generally had lower MAEs than the baseline 
algorithm (BL). The performance of TG appears to trail behind that of similar algorithms in 
the same class (i.e., AB, gAB). We think this is due to its much dampened shrinkage effect 
(Section [3X1D. 



5 Discussion 

We now discuss useful by-products from these content-boosted matrix-factorization tech- 
niques. 

5.1 More interpretable recommendations 

By explicitly pulling "similar" items together in the latent feature space, where "similarity" 
is defined by the contents of the items, the alignment-biased algorithms (AB, gAB, TG) 
produce recommendations that are easier to explain. Research has shown that the "why" 
dimension of recommendation — the ability "to reason to the user why certain recommenda- 
tions are presented" [l9| — improves the effectiveness of the recommender system, especially 



as measured by the conversion rate [20 



To illustrate, we selected a number of movies from a few distinct genres (e.g., thriller, sci- 
fi), as well as a number of recipes from a few different categories (e.g., soup, pasta, cookie), 
and plotted their latent feature vectors G MP from BL and from AB, using the first two 
principal components (Figure EJ. Here, we chose to illustrate the 5-dimensional solutions 
because showing higher-dimensional solutions in 2D would have created more distortion. 

As expected, recipes containing common ingredients — e.g., "Greek chicken pasta" and 
"sesame paste chicken salad" — have been pulled closer together by the ahgnment-biased 
algorithm. The two chicken soups are closer to each other. The dish, "apple stuffed chicken 
breast" , is now closer to chicken pastas than to apple deserts. On the other hand, "oatmeal 
raisin cookies" are pulled away from the other two, "chocolate-chip cookies" because the key 
ingredients are different. 

Likewise, movies belonging to the same genres are now closer to each other, e.g., "Interview 
with Vampire" and "Scream" — both thrillers. The same can be said about the three 
children's movies and the three science fictions. Clearly, the coordinate maps produced by 
the alignment-biased algorithm, AB, are much easier to explain to consumers. 

5.2 Measure of content similarity 

The regression-constrained algorithm (RC) allows us to compute the similarity of two content 
attributes, d and d', using their latent feature vectors, e.g., 

cos(d,d') = ..^f^l ,, , (31) 

where is the d-th row of the matrix B. Notice that, as a measure of similarity, fj3T|) is not 
based on the simple notion of co-occurrence — merely counting how often two attributes are 
shared by the same item, since is driven by both content attributes and user preferences. 
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(a) Recipes (K=5) 



(d) Movies (K=5) 



0.752 
0.739 




0.779 



0.553 



BL AB gAB TG RC 



0.761 




0.886 
0.884 



0.796 - -ni 



BL AB gAB TG RC 



(b) Recipes (K=10) 



(e) Movies (K=10) 



0.762 
0.755 




0.538 



BL AB gAB TG RC 



BL AB gAB TG 



(c) Recipes (K=15) 



(f) Movies (K=15) 



0.718 
0.713 



0.646 



0.597 



0.546 




BL AB gAB TG RC 



BL AB gAB TG RC 



Figure 4: Mean absolute errors (MAEs) on hold-out validation sets from 15 repeated runs. 
For each run, the data set was randomly split into a training set and a validation set (see 
Section 14. 3p . The inverted triangles (v) on the top indicate the average MAEs on the 
validation set using the initial values for each respective algorithm, i.e., P'-^^ and Q*^"-* (or 
B(°) in the case of RC). These are shown here to emphasize the fact that all algorithms were 
started with initial values of approximately the same quality, so that our overall comparison 
is fair (see Section . Notice the broken vertical axes. 
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(a) Recipes (BL) 



(b) Recipes (AB) 



sesame pasta Chk salad 



o cookies (Ck) 
A chicken soup (Chk Sp) 
+ with apples (appl) 
chicken pasta (ChkP) 



Chk tortilla Sp 
oat raisin Ck 



best choc chip Ck 



easy Chk nood Sp 

^ appl stuff Chk breast 
appl sq+ X + 
apprpie+ Greek ChkP 



chewy choc chip Ck 



-0.10 -0.08 



-0.06 



-0.04 -0.02 
PC1 



0.00 0.02 0.04 



CM P 
Q_ 



o cookies (Ck) 
A chicken soup (Chk Sp) 
+ with apples (appl) 
chicken pasta (ChkP) 



best choc chip Ck 

o 
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chewy choc chip Ck o 



^ ^^appl sq 
+appl pie 



appl stuff Chk breast 



sesame pasia Chk satad 



-0.10 -0.08 -0.06 



-0.04 -0.02 
PCI 



0.00 0.02 0.04 



(c) Movies (BL) 



(d) Movies (AB) 



Scream 

A 



Lion King 



Star Wars 
+ 

Return of Jedi 
+ 



o cinildren 
A tliriller 



12 Angry Men 



-0.02 



Batman Returns 



Interview v^th Vampire 



-0.01 0.00 
PC1 



0.01 



o 

D. 



Interview mth Vampire 



Lion King 



Star Trek4-+Return of Jedi 
+Star Wars 

-o!o2 



-0.01 



PCI 



5 Angry Men 



0.00 



o children 
A thriller 
+ sci-fi 
crime 



0.01 



Figure 5: Feature vectors for selected items — 5-dimensional matrix factorization solutions 
projected onto 2 leading principal components for 2D-display. BL = "baseline" algorithm 
(Section [2]); AB = "alignment-biased" algorithm (Section l3.ip . 
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Table H] shows a few examples from both data sets, for pairs of ingredients and genres 
ranging from being highly similar (cos ^ 0) to being highly dissimilar (cos ^ 0). All results 
in this table are based on = 15. It is well-known that high-dimensional vectors are more 
likely to be orthogonal (cos ~ 0) than low- dimensional ones. We chose to calculate fl3Tl) 
using relatively high- dimensional feature vectors so that cosine-values far away from zero 
were more meaningful. 

Some of these pairs are not too surprising. For example, it is easy to see that people 
who like "Thai chili sauce" would also like "jalapeno peppers" (Table S^) — both spicy 
ingredients. Likewise, we are hardly amazed that those who like "crime" movies will probably 
also like "horror" movies, and that the genre "children" goes much better with "adventure" 
than with "documentary" (cos ^ 0.74 > vs. cos ^ —0.52 < 0; Table Hb). 

Other pairs, however, are much less obvious. For example. Table ll](b) shows that users 
who like "war" movies are more likely to favor "animation" movies over "action" movies 
(cos ~ 0.34 > vs. cos ~ —0.21 < 0). Similarly, Table IH^a) tells us that users who like 
"smoked ham" will probably also like "chocolate mint wafer candy" and that, if a user 
likes "cottage cheese", he or she may detest "Swiss cheese". This kind of insight about 
the contents is a unique by-product of the regression-constrained algorithm, and some of 
these novel insights can be commercially useful. For example. Table Hl^a) suggests that "firm 
tofu" might be used to replace "mozzarella" in some recipes — if you are familiar with both 
ingredients, you may very well appreciate that this is not a bad idea at all. 

6 Summary and discussion 

In this paper, we have focused on different ways to incorporate content information directly 
into the matrix-factorization approach for collaborative filtering. Our methodology consists 
of imposing either an "alignment penalty" (Section 13. ip . effectively shrinking items that 
share common attributes toward each other, or a regression-style constraint (Section 13. 2p . 
forcing the latent item-features to be functions of content attributes. Experiments with two 
data sets have shown that these content-boosted algorithms can not only achieve better 
recommendation accuracy, they can also produce novel, commercially useful insights about 
the contents themselves, as well as more interpretable recommendations. 

Our treatment of the problem is by no means thorough. For example, it is certainly pos- 
sible to envision different types of penalties and constraints, and we have not yet attempted 
to study the theoretical properties of these different approaches. This is a rich area with 
many opportunities for continued research. We hope that our paper has not only outlined a 
few useful ideas for practitioners, but also made it easier for researchers to think about this 
type of problems in a more systematic manner. 
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Table 4: Selected pairs of attributes and their cosine similarity (131 p based on their latent 
feature vectors in R^^. 



(a) Recipes 









1 nai cniii sauce oi noL sauce 
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almonds 
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